Adelaide’s population increased from 1.1 million to 1.3 million residents between 2006 and 2016, with 66 million more kilometers traveled on the road network during that time. Infrastructure Australia paints a dire picture of the level of road congestion in Adelaide and its continued worsening in the coming years in line with both an increasing population and an increasing reliance on public transport in comparison to cars. The report estimated the annualized cost of road congestion for Greater Adelaide to be approximately $1.4 billion in 2016 and is projected to rise to $2.6 billion in 2031 (Infrastructure Australia, 2019).
With this backdrop in mind, the client (the South Australia Department for Infrastructure and Transport (DIT)) has in its possession an untapped wealth of data relating to traffic information collected through Bluetooth probes, which take count of passing motor vehicles in a particular time and location, therefore producing a metric for road congestion.
This data will be examined in conjunction with publicly available, historical real time bus trip updates collected by General Transit Feed Specification Realtime (GTFSr), which provides the arrival time for each stop on a bus’s trip. The analysis aims to identify the relationship and robustness of bus travel times to road congestion on road segments of interest.
The aim of the proposed analysis is to investigate the extent of the relationship between bus travel times and road congestion - as measured by motor vehicle travel times - on identified road segments, where a strong relationship indicates a road segment where the bus travel times are less robust to congestion.
Initially, the bus performance metric to be used and applied was the average delay experienced by a bus trip on the segment of interest, as measured by a stop’s predicted arrival time versus the scheduled arrival time. However, this was later revised to measuring the bus travel time between the first and last stops of a segment, removing the possibility that we are measuring how accurately the schedule predicts and/or buffers for congestion. The road congestion metric used is the average travel time of the vehicles across the segment.
A proposal outlining the analysis, the objectives, and the methodology was created and sent to the client, this was followed by a discussion with the client to provide more information regarding the analysis and clarify any points raised by the client. Ultimately, an agreement was reached for the analysis to fulfill the following objectives:
Detailed travel time or congestion analysis comparing public transport response to road traffic on selected sections of road over a given period of time
Repeatable methodology, code, functions, and visuals that produce detailed analysis on other segments of interest
In fulfilling the first objective, the segments of road analysed are South Road and Marion Road in Adelaide. This report uses the former to illustrate the methodology, while the latter is used for comparison. The period of time chosen is March 2022.
Regarding the second objective, the methodology and the code created aim to ensure the requirement of as little manual input and edits as possible when applied to different road segments.
The analysis undertaken in this report will form the basis of future analysis into:
additional road segments of interest to generate a ranking of bus network robustness which can help inform the allocation of resources
analysing the factors that can effect bus travel times such as the use of bus lanes, number of bus stops, traffic lights, etc.
creating a model predicting bus travel times using identified features.
Three main data sources are used: DIT Addinsight, GTFS, and GTFSR. These data sources and their associated sub-sources will be outlined below. All the data is stored in the cloud using Amazon Web Services (AWS) and is accessed through Athena which uses regular SQL syntax.
The data cleaning and wrangling will be discussed as it was the part of the analysis that required the highest workload.
As mentioned above, the methodology will be illustrated on South Road, which is one of Adelaide’s most important and major roads, and regularly suffers from congestion (Infrastructure Australia, 2019).
Figure 4.1: South Rd on map. Source: Google Maps
This is a common format developed by Google and used by public transport agencies around the world and contains static or scheduled information about public transport services such as routes, stops, schedule and geographic transit information. For the purposes of this analysis, only the bus routes and bus stops datasets will be used.
These are the bus routes that go through South Road. These were identified by overlaying all the network routes on a map in Tableau and the routes on South Road were highlighted and exported to a list. The dataset contains simply the unique collection of route_id on the segment.
The list of bus stops on the segment were identified in the same fashion as when identifying the routes. This produces a list of the stop_id’s on the segment.
A dataset containing all the stops in the bus network and information relating to each stop is used and filtered to only the stops present on the segment (for the purposes of this report, the file containing all the stops on the network was pre-filtered to the stops on the segment only to accommodate Github file size limits. However, the code and methodology contained here apply as if the complete dataset were used and filtered through the code).
| Variable | Description |
|---|---|
| stop_id | Unique stop identifier |
| stop_name | Name of the location. Uses a name that people will understand |
| stop_desc | Address of the stop |
| stop_lat | Latitude of the stop |
| stop_lon | Longitude of the stop |
| direction | Road direction of the stop |
The direction variable is manually created. In this case, if the stop is on the east side of South Road, then it is southbound (SB) away from the city; if the stop is on the west side of South Road, this it is northbound (NB) towards the city.
The bus stops will be plotted on a map to confirm they are all, in fact, on South Road.
Figure 4.2: Bus stops on South Road
Unlike GTFS which provides static information, GTFSR provides real time information consisting of two types. The first type is a trip’s real time updates regarding a bus stop’s expected arrival times and delays. The second type is a real time update of a bus’s geographic position and speed at a specific point in time. This analysis uses the former only.
Once the bus routes that go through the segment were identified as outlined above, the real time updates for all the trips in March 2022 according to the routes were retrieved from the AWS database using Athena. This dataset is used to derive the bus travel time through the segment, which is the first element in the relationship being assessed in this analysis, with the other being the vehicles travel time as a measure of congestion. The SQL query to retrieve the updates can be found in appendix 6.1.
First, the unedited data will be described.
| Variable | Description |
|---|---|
| route_id | Unique route identifier |
| start_date | Start date of the trip |
| vehicle_id | Unique vehicle identifier |
| timestamp | Timestamp of the real time update |
| trip_id | Unique trip identifier |
| stop_sequence | Order of stops for a particular trip |
| stop_id | Unique stop identifier |
| delay | The current schedule deviation for the trip. The delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule) |
| arrival_time | Predicted arrival time for a stop on a particular trip |
It is important to note the following:
One route_id can have many trip_ids
One trip_id occurs a maximum of one time a day, the trip_id can occur on multiple days
As a bus trip is occurring, at a certain timestamp a real time prediction of the arrival times of the remaining stops on the trip are updated.
Cleaning and wrangling this dataset proved to be the most challenging and time consuming section of this analysis, with many methodologies, cleaning iterations, and code trialed to arrive at the optimal treatment. This is due to the complex relationships between the observations in the dataset, and the variety of errors and inconsistencies encountered.
The following preliminary adjustments were done:
As each stop on a given trip can have multiple arrival time predictions with each update timestamp prior to reaching that stop, the SQL query insures that each stop only has the predicted arrival time corresponding to the latest timestamp, given that the later the prediction, the more accurate it is.
As a trip can begin and end outside the bounds of the segment, the updates were constrained only to those stops within the segment, in either direction.
Weekends and holidays were removed as we are interested in the relationships during working days.
A new variable to_stop_time was created. This variable measures the time taken to reach each stop from the prior stop in seconds, within each trip. The variable was created to facilitate a potential deeper understanding of the data, to highlight any errors, and for potential utilities in the future such as drilling down to examine the pattern on a stop-basis.
Through this variable, a range of errors were discovered that needed to be amended. This is how the data appears before any remedial actions are taken.
Figure 4.3: Unedited to-stop times contain negative values
Figure 4.3 shows that to_stop_time contains negative values to the left of the red line, this is a clear error as it is not possible for time taken to reach a stop to be negative. Additionally we can see very high delay values in clusters, above 4,000 seconds, which is over an hour long.
In total, there were eight types of errors identified in the data. The list of errors, an example of each error, and the code to rectify the errors can be found in appendix 6.2.
Great effort was put into identifying each type of error and remedying it in a way that does not produce further errors, or that removes large amounts of data, identifying the correct order of the types of errors to be tackled was also essential. In addition, formulating the code to fix each error required various trial and error iterations. This was all done to ensure the errors were removed as surgically as possible to minimize data loss and due to the sensitive nature of the relationships between the stops on each trip.
The percentage of error entries located and fixed in the data was 3.82%. The cleaned data now appears as follows:
Figure 4.4: Cleaned to-stop times do not contain negative values
With the data now cleaned, two additional variables were created called first_stop and last_stop, which identify the first and last stops of each trip within the segment. The total time per trip can now be derived by calculating the time between the first stop and last stop of the trip within the segment. The arrival time of the first stop and last stop on the segment will be regarded as the start and end time, respectively, of the trip. The distribution of the trip times per direction is shown below. The two most occurring first-last stops pair per direction will be used.
Figure 4.5: Different stops pairs in the same direction have different trip times
As figure 4.5 shows, different first-last stops pairs in the same direction have different travel times. This means that different routes and trips can have different travel times solely based on their respective first and last stops on the segment, this renders the travel time between them incomparable as they occupy different distances. Therefore, only trips with the same pair of first and last stops within the segment will be kept, with the remaining trips discarded; there can be only one pair of first and last stops per direction, so that the distance is constant for the all the trips and the time is therefore comparable.
This pair of stops is identified as the most occurring pair per direction. Now, only trips with this pair of first and last stops are kept in the data. The stops pair per direction can be see in the map below
Figure 4.6: Most occurring pair of bus stops per direction
The distribution of the trip times per direction is shown below.
Figure 4.7: Excessively large trip times exist, especially southbound
As figure 4.7 shows, excessive trip times occur, especially southbound. It is difficult to determine whether these are errors or genuine trip times without using further information. A variable called delay_diff is created which calculates the size of the difference between the delay of the first stop and the delay of the last stop per trip. Excessive values of this variable indicates the large travel time is due to an error as either of the stops has an artificially large delay or early arrival. A plot of delay_diff vs travel_time is shown below.
Figure 4.8: Size of difference between first and last stop delays. Southbound is more problematic
Based on figure 4.7, trips with a delay_diff greater than 600 (10 minutes) were removed as they were most likely errors. The resulting data now appears as follows:
Figure 4.9: Excessively large trip times no longer exist
Figure 4.9 shows that travel times southbound away from the city are larger and more dispersed than the travel times northbound towards the city.
Figure 4.10: Greater variation exists between the travel times of both directions in the evening
Figure 4.10 shows that travel times are in fact very similar during the morning rush hour, while in the evening, travel time southbound away from the city is longer as expected.
The data relating to only the first stop per trip was kept since the arrival time of the first stop will be considered as the trip start time used as the basis for aggregation later in the analysis.
The data in the bus travel times will be split into five minute time periods, with the arrival time of the first stop on the trip used as the basis for this segregation. For example, all bus trips that start between 2022-03-01 12:00:00 and 2022-03-01 12:05:00 will be included in the same time frame. Since each time frame can contain multiple trips, the bus travel times will be averaged into one average bus travel time, this is done to establish a one-to-one relationship with the vehicles travel time, which are also in five minute intervals. The final dataset looks as follows:
| Variable | Description |
|---|---|
| day | Date of measurement |
| time | Time of the day in hour:minute:seconds of the measurement |
| hour | The hour of the measurement |
| rush | Whether the measurement occurrs during rush hour. Morning rush hour occurrs between 6am and 10am, evening rush hour occurrs between 3pm and 7pm, neither otherwise |
| direction | The direction of travel |
| number_buses | The original number of trips during the five minute interval |
| bus_time | The bus trip travel time across the segment |
This is traffic information collected by DIT, which is done through the use of Bluetooth devices that tag a Bluetooth-equipped vehicle when it comes into its range. The location of a Bluetooth device is called a site, and a link is a segment of road between two sites, an origin site and a destination site. This allows for the calculation of metrics such as the time taken to travel through the link, among others.
The DIT Addinsight database is very large and contains many tables, each recording its own set of information, with foreign keys connecting most tables. This analysis uses only a subset of the tables in the database, presented below.
This dataset contains holidays dates, which is the only variable used.
This dataset lists all the links present in the network, not just the segments of interest. A link is a one-way section of road between two adjacent sites. Addinsight will measure statistics for every link in real time. The variables present and used from this dataset are
| Variable | Description |
|---|---|
| dms_update_ts | Database update datetime |
| id | Unique link identifier |
| name | Description of link |
| originid | The Bluetooth site ID that begins the link |
| destid | The Bluetooth site ID that ends the link |
| enabled | Boolean. Disabled links do not generate statistics |
| length | The link length in metres |
| direction | Link direction of travel |
These are all the links that are present on the road segment examined only. These links were identified by inserting all the links into an interactive map in Tableau, and then manually selecting the area of interest on the map, which produces a list of the links in the highlighted area, and the geometry for each link. It is important to note that the some links can overlap with other links on the segment.
| Variable | Description |
|---|---|
| linkid | Unique link identifier |
| ordernumber | Order of geometry coordinates |
| latitude | Latitude |
| longitude | Longitude |
The links data was limited to the end of the period examined, in this case the end of March 2022. The data was filtered to the most recent update per link. This is done to obtain the most recent enabled status of each link. The two links datasets were joined together to create a single dataset which contains the links on the segment and all their related information. Finally, the name variable was used to created two additional variables, start_loc and end_loc which give the name of the link start location and end location, respectively. These variables will be used to identify the sequence of non-overlapping links on the segment later.
START_LAT END_LAT!!!
With the bus travel times now obtained, the objective is to obtain the motor vehicle travel times across approximately the same length of the segment as the length between the bus stops pairs identified for each direction. The travel times statistics are generated by each enabled link.
As stated in the previous section, the list of links exported from Tableau contains all the links on the segment, this includes links that overlap with each other, where a portion of a link is simultaneously covered by another link. This overlap can result in double counting when calculating the aggregated statistics of the links as the overlapping links can measure the same motor vehicles at the same time. It is therefore essential to identify and select a sequence of non-overlapping links per direction that covers the length of the segment required.
The process of identifying the sequence of non-overlapping links also required time-intensive multiple trials and iterations. The method to identify the links, as with the analysis in general, attempts to ensure maximum automation and minimal manual input. Each direction of travel will be processed separately using the same method. The northbound direction will be used here as an illustration.
First, the links dataset is ordered from south to north using the starting latitude of each link. The dataset will also be filtered to enabled links as disabled links do not generate statistics. Next, the names of the locations of the start and end points of the segment will be entered, these are procured by examining the stops map in figure 4.6 since we need both metrics to cover as much of the same segment as possible. In this case, the segment starts at the intersection of South Road with Walsh Avenue and ends at the intersection of South Road with Anzac Highway. However the link ending in Walsh Avenue is disabled, therefore the intersection prior will be used as the starting point for now, which is Celtic Avenue, and the statistics for Walsh Avenue will be derived later. Entering the names of the starting and end points is the only manual step in the code.
Once the locations are determined, the links dataset will be filtered to contain only the links that occur between the first occurrence of the start location and the last occurrence of the end location. If two links start at the same location, the shorter link will be chosen as the data is more granular.
The sequence of non-ovelapping links along with the stops pair in the northbound direction is shown on the map below:
Figure 4.11: Sequence of non-overlapping links in the northbound direction and stops. The alternating colors show the non-overlapping property of the links
The same process will be applied to the southbound direction.
Figure 4.12: Sequence of non-overlapping links and stops in the southbound direction
The map displaying the links from both directions is shown below:
However, as stated previously, the true limit of the segment should be at Walsh Avenue, this gap can be seen in the map in figure ?? at the southern end. Since the link connecting Celtic Avenue to Walsh Avenue in both directions is disabled and statistics were not collected, the statistics will have to be imputed.
The imputation method starts with first identifying the linkid of the missing link(s), as well as the closest enabled links preceding and proceeding the missing link(s). This process is completely automated with the exception of the manual specification of the start and end locations of the missing links. The process is implemented for both directions.
With both the links that make up the segment as well as the links needed for imputation identified, the link statistics for the time period are retrieved from AWS and Athena using the SQL query found in appendix 6.3.
These are the aggregated five minute statistics generated for each link. The dataset is as follows:
| Variable | Description |
|---|---|
| logtime | Current interval timestamp |
| linkid | Unique link identifier |
| tt | Travel time in seconds |
Before continuing with the imputation process of the missing links, the dataset will be cleaned and validated.
Holidays and weekends are excluded as with the bus travel times, and the dataset is joined with the links dataset to retrieve the direction and length of each link. Finally, the speed will be calculated by dividing the length of the link by the link travel time, and is adjusted to be in kilometers per hour. This is done for error detection.
A plot of travel speed vs travel time is shown below:
Figure 4.13: Very large speed values are present, these are judged to be clearly errors
Figure 4.13 shows speed vs travel time, with the points colored by link. We can that excessively large speed values are present in the southbound direction. Since the length per link does not change, this indicates that errors were made when the travel times were logged. Observations with speeds over 150 km/h will be removed from the dataset.
Figure 4.14: Erroneous speed values were removed
A plot of the length in meters vs the travel time will be examined:
Figure 4.15: Travel time increases with length
Figure 4.15 shows that travel time increases as the link length increases, which is an accepted and expected result.
With the links statistics now cleaned and validated, we return to the process of imputing the statistics for the missing links shown in the map below. The green links are the links with statistics, while the red links are the linking with missing statistics.
Figure 4.16: Links with statistics are colored in green, links with missing statistics are colored in red
The following steps are performed to impute the travel times of the missing links. For each link:
For the preceding and proceeding links, calculate travel time divided by the length per five minute logtime
Average (1) between the preceding and proceeding links per five minute logtime
Multiply (2) by the length of the missing link to obtain the travel time for each five minute logtime
We will check that links with missing statistics no longer exist by once again plotting the links on the map below
Figure 4.17: The links are all green indicating that there are no longer missing statistics
As the links in figure 4.17 are all green, this indicates that travel time figures exists for all the links.
Finally, to obtain the total motor vehicle travel time across all the segments, the travel time of all the links in the same five minute logtime will be summed, per direction. The final dataset for analysis contains the following variables:
| Variable | Description |
|---|---|
| day | Date of measurement |
| time | Time of the day in hour:minute:seconds of the measurement |
| hour | The hour of the measurement |
| rush | Whether the measurement occurrs during rush hour. Morning rush hour occurrs between 6am and 10am, evening rush hour occurrs between 3pm and 7pm, neither otherwise |
| direction | The direction of travel |
| links_time | The total travel time across the segment |
The distribution of the motor vehicle travel times per direction during the rush hours appear as follows:
Figure 4.18: Vehicle travel times behave as expected
Figure 4.18 shows the vehicles travel times behave as expected in that northbound travel time is longer in the morning, while southbound travel is longer in the evening. However, it should be noted that the difference in the travel times between both directions is much greater during the evening than it is during the morning, exhibiting a similar behaviour to bus travel times seen in figure 4.10.
With all the data now complete, the analysis between the two travel times is ready to be performed.
The travel times from both sets will be compared against one another. This is done to gain a general understanding of the relationship as well as to validate the datasets, as we would expect to observe a similar pattern between both travel times. The comparison will be done through a series of graphs.
Figure 5.1: Vehicles are faster in both directions. Distributions of both types resemble each other
Figure 5.1 shows that for northbound travel to the city, vehicle travel time largely remains the same during both periods of rush hour, while bus travel time actually increases in the evening, a surprising result. While for southbound travel away from the city, both travel times increase as expected. The bus travel times are generally slower than vehicle travel times, and both types exhibit general patterns overall.
Figure 5.2: Northbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns
Figure 5.3: Southbound bus travel times are slower than vehicle travel times. Both exhibit similar patterns
Both figures 5.2 and 5.3 show that the travel times from both types generally follow a similar pattern, this indicates the data from both data sets are valid, as we would not expect to see very different patterns. Buses are almost always slower than vehicles, as buses need to load and unload passengers at the various bus stops along the road, in addition to them accelerating at a slower rate since they are heavy vehicles. We also notice that towards the end of the day in both directions, the travel times seem to level off at a low value, this is likely due to less traffic being present on the road in the evening time, leading to a faster traversal through the segment, with only constant factors affecting the travel time such as the speed limit and traffic lights.
A scatter plot of the travel times will be examined:
Figure 5.4: A positive relationship exists between both travel times for both directions during peak times
Figure 5.4 shows that a positive relationship exists between the travel times, more so for the southbound direction in the evening.
The correlations between the travel times are:
| Rush | Direction | Correlation |
|---|---|---|
| Morning | NB | 0.80 |
| Morning | SB | 0.49 |
| Evening | NB | 0.67 |
| Evening | SB | 0.87 |
Table 5.1 shows that the travel times between buses and vehicles are high correlated in the morning northbound towards the city, and in the evening southbound away from the city.
To gain a clearer picture of the patterns and relationship throughout an average day, the travel times within 30 minute aggregates of the same time frame will be averaged across all the days. For example, for each travel type, all the measures occurring between 12:00 and 12:30 across all the days will be averaged, then plotted.
Figure 5.5: Average travel time patterns by vehicle and direction. Peak times are highlighted
Figure 5.4 shows the average pattern of travel times across the day, by direction and type. The morning and evening rush hours have been highlighted as they are the parts of the day worth of examination. Analyzing northbound travel towards the city, both rush hour times display a similar level of travel time for both types, and the travel time in the rush hours are not much greater than non-rush hour times. This is an unexpected result as it is expected that travel times northbound towards the city would be higher in the morning rush hour. Southbound travel away from the city, however, follows expectations as the travel time for both types dramatically increases in the evening rush hour as workers leave the city.
The goal of the analysis to ascertain the extent of the relationship between the variations in the motor vehicle travel times and the variations in the bus trip travel times, the variation is in reference to travel times during the same time frame across the entire period. In other words, if the vehicle travel time varies by a certain level relative to the usual travel time during the same time frame, can we observe a reflection of this variation in the bus travel time? If so, by how much?
In order to assess the variation, the travel times will be standardized. The function standardiser is created which separately standardizes both the bus travel times and vehicle travel times according to the total data in the entire period based on either:
the five minute time frame. For example, a bus/vehicle travel time on 2022-03-01 7:00am would be standardized against all the other travel times that occur on 7:00am in the period
the hour of travel. For example, a bus/vehicle travel time that occurs on 2022-03-01 between 7am and 8am would be standardized against all the other bus trips in the period that occur during that hour
the rush hour of travel. For example, a bus/vehicle travel time that occurs on 2022-03-01 during the morning rush hour would be standardized against all the other bus trips in the period that occur during morning rush hour
These options are provided to the function as an argument (“time”, “hour”, “rush”). As the time frame widens, more data is available for standardization. This is why in addition to standardizing the data, the standardiser function also stores the total number of data points present in each time frame according to the method chosen. The function also removes observations greater than three standard deviations away as these are considered outliers that can affect the analysis.
Ideally, the travel times would be standardized according to the same five minute time frame across the entire period as this would provide the highest accuracy. However, as the bus trips per five minute time frame were averaged into one five minute travel time, and we are analyzing only one month of data containing 21 working days, there is not enough travel times to accomplish this, since there would be a maximum of 21 data points per five minute time frame used for standardization. Instead, the default standardization parameter is by hour, which provides a much greater number of data points.
With the bus and vehicle travel times standardized, we can now examine the relationship between the travel times with respect to variation. If the vehicle travel time deviates from the average, do we observe a similar deviation by the bus travel time?
Plots of the standardized travel times are shown below:
Figure 5.6: Northbound morning travel time variations are similar
Figure 5.7: Southbound evening travel time variations are similar
Figure 5.6 and figure 5.7 show that variations in vehicle travel times are in fact closely matched by variations in bus travel times, especially during the morning and northbound towards the city, where the variations also exist in greater magnitudes.
Figure 5.8: A positive relationship exists between both travel times for both directions during peak times
Figure 5.8 shows that the standardized travel times are particulary correlated in the evening southbound away from the city.
The correlations between the standardized travel times are :
| Rush | Direction | correlation |
|---|---|---|
| Morning | NB | 0.67 |
| Morning | SB | 0.22 |
| Evening | NB | 0.44 |
| Evening | SB | 0.74 |
Table 5.2 shows that strong correlation exists between the standardized travel times during the evening southbound away from the city.
The following plot increases the granularity of the correlation to a per hour figure:
Figure 5.9: High correlation is more uniform in the evening southbound
Figure 5.9 shows that high correlation exists in the morning northbound towards the city between 7am and 9am, while in the evening southbound away from the city, high correlation exists throughout the rush hour and peaking betwen 6pm and 7pm.
These trips were removed as they will be in the wrong timeframe when analyzing against the vehicle travel time.